Project Title

INFO 526 - Summer 2025 - Final Project

This project will examine past NCAA Men’s March Madness data in order to develop key insights and predictions utilizing data visualization through plot creation/interpretation
Author
Affiliation

Matt Osterhoudt

School of Information, University of Arizona

Introduction

March madness is the NCAA Division 1 annual basketball tournament. It is single-elimination based tournament, and the data I will be using is from 2008-2024. 2020 is not included because of Covid. There are two data sets that will be used: Team Results and Public Picks. Let’s start with the simpler one: the Public Picks data set contains the percentage of people who picked the team to win game in the rounds 64, 32, 16, 8, final 4, and finals for the 2024 year.

The second data set is Team Results, and contains data from 2008-2023. This data set contains more variables, such as PAKE (performance against Komputer expectations) and PASE (Performance against seed expectations), along with total historical games teams have played in the tournament as well as how often they have made top 64, 32, 16, 8, 4, finals, and champion. There are also a couple of indicator variables, such as f4percent and champpercent, which notes likelihood of a team getting at least 1 final four or at least 1 championship.

Question 1:

How well does past performance from 2008-2023 correlate with predictions for the 2024 tournament?

Question 1 Introduction

Question 1 Approach

Question 1 Analysis

# A tibble: 236 × 22
   teamid team     pake pakerank  pase paserank games     w     l
    <dbl> <chr>   <dbl>    <dbl> <dbl>    <dbl> <dbl> <dbl> <dbl>
 1      1 Abilen…   0.7       45   0.7       52     3     1     2
 2      2 Akron    -0.9      179  -1.1      187     4     0     4
 3      3 Alabama  -2.1      211  -2.9      220    10     5     5
 4      4 Albany   -0.4      147  -0.3      138     3     0     3
 5      6 Americ…  -0.5      160  -0.4      150     3     0     3
 6      8 Arizona  -1.7      206  -2.5      216    28    17    11
 7      9 Arizon…  -2        209  -1.9      206     5     1     4
 8     10 Arkans…   4.3       11   3.5       16    18    11     7
 9     11 Arkans…   0         76   0         78     1     0     1
10     12 Auburn    0.6       53   1.4       30    11     7     4
# ℹ 226 more rows
# ℹ 13 more variables: winpercent <dbl>, r64 <dbl>, r32 <dbl>,
#   s16 <dbl>, e8 <dbl>, f4 <dbl>, f2 <dbl>, champ <dbl>,
#   top2 <dbl>, f4percent <chr>, champpercent <chr>,
#   historical_f4_percentage <dbl>, pase_quant <fct>
# A tibble: 58 × 5
   team    historical_f4_percen…¹ pase_quant public_f4_percentage
   <chr>                    <dbl> <fct>                     <dbl>
 1 Akron                     0    Q1                         0.1 
 2 Alabama                   0    Q1                         2.89
 3 Arizona                   0    Q1                        10.0 
 4 Auburn                   25    Q4                         3.64
 5 Baylor                    9.09 Q2                         4.35
 6 BYU                       0    Q1                         0.64
 7 Clemson                   0    Q1                         0.3 
 8 Colgate                   0    Q2                         0.09
 9 Colleg…                   0    Q2                         0.1 
10 Colora…                   0    Q1                         0.11
# ℹ 48 more rows
# ℹ abbreviated name: ¹​historical_f4_percentage
# ℹ 1 more variable: delta_f4 <dbl>

Question 1 Discussion

Question 2:

something something?

Question 2 Introduction

Question 2 Approach

Question 2 Analysis

Question 2 Discussion